Statistical analysis of three data sources for Covid-19 monitoring in Rhineland-Palatinate, Germany

In Rhineland-Palatinate, Germany, a system of three data sources has been established to track the Covid-19 pandemic. These sources are the number of Covid-19-related hospitalizations, the Covid-19 genecopies in wastewater, and the prevalence derived from a cohort study. This paper presents an extensive comparison of these parameters. It is investigated whether wastewater data and a cohort study can be valid surrogate parameters for the number of hospitalizations and thus serve as predictors for coming Covid-19 waves. We observe that this is possible in general for the cohort study prevalence, while the wastewater data suffer from a too large variability to make quantitative predictions by a purely data-driven approach. However, the wastewater data and the cohort study prevalence are able to detect hospitalizations waves in a qualitative manner. Furthermore, a detailed comparison of different normalization techniques of wastewater data is provided.


Data sources
Wastewater Wastewater was taken twice per week in 15 sewage plants in Rhineland-Palatinate by taking 24-h composite samples, following the same guidelines as were developed for the EU project ESI-CorA 19 .Wastewater was collected time-, volume-, or flow-proportionate and prior to or after the sand filter, depending on the sewage plant.The samples were stored at +4 • C to +10 • C and processed within 48 h after collection.The laboratory followed the manufacturer's standard protocol for the Promega Maxwell (R) RSC Enviro Total Nucleic Acid Kit for extracting the viral information.
For each measurement, the number of N1 and N2 genecopies, respectively, was determined, which are the two gene targets of the SARS-CoV-2 virus that are commonly used in laboratory testing for Covid-19 1 .In addition, a reference virus, the Pepper mild mottle virus (PMMoV), was measured.This is a plant RNA virus that is known to be a good reference in wastewater to be compared with other viruses 20 .
Since wastewater occurrence of viruses may vary with the rainfall, the water volume of each measurement day at each sewage plant was measured.In addition, some wastewater related parameters were collected to detect anomalies, concretely the water temperature, the pH value in water, the chemical oxygen demand, the water conductivity, and total organic carbon.The respective sewage plants together with the number of connected inhabitants and the plants' dry-weather flows are also available.To investigate possible weather-related influences, we added information on the air temperature and the precipitation that is available online by the German Meteorological Service 21 .Since these parameters can not be aggregated on federal state level, they are only used in "Finding 4: Wastewater data alone does not allow quantitative prediction of the cohort study prevalence".
While the wastewater surveillance in Rhineland-Palatinate is still ongoing, we use its measurement values between December 2022 and April 2023, thus for 5 months, corresponding to the availability of the other data sources.We have this data for each treatment facility individually.A publicly available extract of this data is available online 22 .The participating sites include the five cities that were selected for the cohort study.

Cohort study
In 2022 and 2023, the University Medical Center of the Johannes Gutenberg-University Mainz conducted a cohort study ("SentiSurv") on behalf of the Ministry of Science and Health of Rhineland-Palatinate 23 .A cohort was drawn randomly from the five largest cities in Rhineland-Palatinate, namely Mainz, Ludwigshafen, Koblenz, Trier, and Kaiserslautern.These cities are uniformly distributed across the federal state.The participants are almost representative in terms of gender and age except for children who have been excluded for legal reasons.During the course of the first months, all five cohorts reached their targeted size of 2800 volunteers each.In addition to the regular completion of questionnaires, participants performed a rapid test on two fixed days each week and reported the outcome in a mobile app.Starting from January 2023, the number of participants included in the study was large enough to allow a statistically sound interpretation.
The results of the cohort study are available between January 2023 and April 2023, thus for 4 months.We have this data for each of the tested cities individually.The data has been displayed online during the course of the study 24 .

Hospitalizations
As an official reference value, the number of Covid-19 related hospitalizations are used.Those are the number of hospitalized persons that were tested positive on Covid-19 for a given day.Note that this differs from the commonly reported number of newly admissioned patients with a positive test.A Covid-19 test was mandatory in hospitals in Rhineland-Palatinate until March 31, 2023, and therefore, the hospitalizations are available between December 2022 and March 2023, thus for 4 months.The data was collected on a weekdaily basis from the Ministerium für Wissenschaft und Gesundheit (Ministry of Science and Health) in Rhineland-Palatinate by calling the individual hospitals and ask for their positively tested patients.We only have these values for the whole of Rhineland-Palatinate, not broken down to cities.To the best of our knowledge, the numbers are not officially reported.

Rhineland-Palatinate
Rhineland-Palatinate is a federal state in the southwest of Germany.It has about 4 million inhabitants and five cities (Mainz, Ludwigshafen, Koblenz, Trier, Kaiserslautern) with more than 100,000 inhabitants.Those are the five cities in which the cohort study was conducted.The largest city is the state capital Mainz with around 220,000 inhabitants.In Rhineland-Palatinate, there are about 100 hospitals that contributed to the number of Covid-19 related hospitalizations described in "Hospitalizations".
Figure 1 shows a map of Rhineland-Palatinate and its location in Germany, which illustrates the distribution of the 15 sewage plants and the five cohort study cities.Furthermore, the map is shaded relative to the population density of the municipalities, ranging from 36 to 2226 people per square kilometer.

Parameter computation
From the collected parameters, some additional values could be derived.To appropriately depict the variation in N1 and N2 genecopies simultaneously, one may report their mean.Furthermore, for all three genecopies (N1, N2, and their mean), one can report their absolute value, their value per ml (i.e.normalized by the flow volume adjusted by the sewage plant's dry-weather flow), and their value in relation to the reference virus PMMoV.The latter value is computed as genecopies / PMMoV • 100,000.Three genecopies and three normalization techniques result in a number of nine parameters that can be computed for the wastewater.
The nine numbers defined above (N1, N2, and mean, reported as absolute value, per ml, and per PMMoV, respectively) have to be normalized with the number of inhabitants in order to combine the values per sewage plant to summary values for the whole of Rhineland-Palatinate.For the three genecopies (N1, N2, and mean), this normalization happens in the same manner.Therefore, in the following formulas, the normalization is presented for the absolute value of genecopies, for the genecopies per ml, and the genecopies per PMMoV, where genecopies stands for N1, N2, or their mean.Since the total number of genecopies is an absolute value, one can compute For the relative parameters, the numerator in the above formula would not necessarily increase with increasing number of inhabitants.This implies that sewage plants with small population may be over-represented.Therefore, for these parameters, the following formulas hold:

Number of genecopies =
i: sewage plant genecopies at i i: sewage plant inhabitants connected to i .The resulting values are multiplied by 100,000 in order to report the values per 100,000 inhabitants.
For the cohort study, the daily prevalence is reported.This means that the number of positively tested participants is divided by the total number of participants.The resulting value is multiplied by 100,000 to obtain the prevalence per 100,000 inhabitants.The cohort study Covid tests were done on Sunday and Wednesday.We denote the Sunday measurement as first measurement and the Wednesday measurement as second measurement of a week.
In order to match the hospitalization data to the two measurements per week, the mean of the number of hospitalizations of Monday, and Tuesday for week measurement one and of Wednesday, Thursday, and Friday for week measurement two are computed, respectively.

Statistics
The resulting parameters can be depicted as point plots.To measure the uncertainty in the data, confidence intervals for all values are needed.The prevalence can be interpreted as a rate and, therefore, Wilson confidence intervals for rates 26 can be computed and multiplied by 100,000.For continuous parameters, we compute the confidence interval for the mean by assuming a normal distribution.To this end, the standard deviation of the measurements has to be estimated.We do this applying a time-shifted concept by computing for each measurement the standard deviation of the respective measurement, and the two measurements before and after this measurement.For the hospitalizations, multiple values are summarized to compute one value per measurement (cf."Parameter computation").These values are used to compute the standard deviation for each measurement.Since the reported values cannot be negative, the lower bounds of the confidence intervals are cut at zero.
To illustrate time trends, a smoothing method is needed.We applied locally estimated scatterplot smoothing (LOESS) regression.This method fits for each data point a linear regression in a pre-defined neighborhood, i.e. a certain proportion of the entire dataset around the data point, of this point and predicts the point by this linear regression.By this neighborhood approach, a new linear model is fitted for each data point and thus a local smoothing is performed.The proportion of data points from the entire dataset that is used for the local linear regression is called span.In this paper, we used a span of 50% for the wastewater data and 60% for the comparison of all data due to the lower number of datapoints for the cohort study and hospitalization.
In order to investigate if there is a time shift between different parameters, time lag correlations are computed between the raw values of these parameters.This means that one of the two parameters to be compared is shifted by a time step of l and the correlation between the two resulting time series is computed.The result is the correlation with time lag l .This approach will be used to compare if there is a time shift between prevalence, hospitalizations, and wastewater data.
We tried to use the available data to build prediction models in a data-driven manner, i.e. without the knowledge of any biological background.First, we investigated whether one can predict the number of hospitalizations by wastewater data and the cohort study prevalence.Second, we analyzed the predictive capability of the wastewater data to predict the cohort study prevalence.To this end, regression models were applied.We fitted random forests 27 to predict the outcome of interest.The number of trees in each forest was set to 5000 and the candidates at each split were set to 5. In order to avoid overfitting, leave-one-out cross-validation was applied.This means that for each measurement, a random forest was fitted on all other data points, but not this measurement, and the measurement was then predicted by the resulting random forest.
To make the scenario realistic, past values of the variables were included as covariates into the regression models.Those past values are denoted with lag i when talking about the timepoint i measurements before the current observation.Feature importance was computed by a random forest fit on the full datasets to investigate which covariates influence the prediction remarkably.This feature importance was calculated as the factor by which the random forests prediction error increases when the respective feature is shuffled within the underlying dataset.As error measure, the robust mean squared error (RMSE) was used.
Data was analyzed and visualized using the statistical software R 28 , version 4.3.0,and the tidyverse packages 29 .LOESS regression was done using R's standard function 'loess' .Random forests were fit with the R package randomForest 30 and the R package iml 31 was applied for the calculation of feature importance.

Results
We present our results as four main findings that are given as the respective subchapter headings.

Finding 1: Trends in wastewater are present irrespective of genecopy type and normalization technique
Figure 2 shows the three genecopies (N1, N2, and their mean) together with three normalization techniques (absolute value, per ml, and per PMMoV).
Regarding trends, all nine panels show the same pattern with a small wave in late 2022 and a second, larger, wave starting in February 2023.The N2 values are larger than the N1 values but show the same pattern.Normalizing the data by the flow or the reference virus compresses the data without changing the depicted trends.Therefore, normalizing the data seems to act as regularization.When normalizing the data by the flow or the Genecopies/ml = i: sewage plant (genecopies/ml at i) PMMoV, the second wave reaches its peak before March 2023 while the absolute virus value continues increasing until mid March 2023.This behavior is observed for N1, N2, and their mean analogously.After the peak, all curves decrease and show a moderate increase in the last measurements.
Finding 2: Qualitative trends are observed in all three data sources Figure 3 compares the three data sources genecopies (normalized per ml), prevalence from the cohort study, and number of hospitalizations according to their availability described in "Data sources".There are two hospitalization waves.The first, slightly higher wave, has its peak around New Year.After a low point in late January, the second wave has its peak in the first half of March.Due to missing data from 2022, the cohort study prevalence only captures the second wave.The peak of this wave is reached about 7 days earlier than in the hospitalizations.This indicates that the cohort study may be a good tool to detect hospitalization waves with some time lead.The wastewater data capture both waves but in different intensity.The second peak is here distinctly higher than the first one.Of note, the wastewater values show a much larger variance than the other two data sources.This complicates the interpretation of single values and the early detection of potential virus waves.

Finding 3: Cohort study prevalence and wastewater data allow prediction of hospitalizations
Figure 3 indicates that hospitalization waves may be detectable earlier in the wastewater and the cohort study prevalence, respectively.Table 1 shows the time lag correlation between the raw values of the three data sources (genecopies/ml in wastewater, cohort study prevalence, and number of hospitalizations).In each column, two of the sources are compared pairwise by computing their Pearson correlation under the given time lag.The time lag is defined such that the first mentioned value is shifted to the left by the respective time lag.
The genecopies per ml show a slight correlation with the number of hospitalizations 14 to 3 days (4 to 1 measurements) later.The highest correlation values are observed between the cohort study prevalence and the hospitalizations within a time difference of 0 to 7 days (0 to 1 measurements).Genecopies per ml and the prevalence are also positively correlated, the prevalence seems to fit better to the hospitalizations than the genecopies per ml.www.nature.com/scientificreports/Additionally, we investigated whether the number of hospitalizations can be predicted by the cohort study prevalence, raw values of the genecopies, and the genecopies per ml including the past three values of these parameters.As described in "Statistics", random forests with leave-one-out cross-validation were fitted.
Figure 4 shows the results of the hospitalization prediction.In the left panel, the true numbers of hospitalizations are plotted together with the random-forest-predicted hospitalizations.The prediction is able to detect the wave in the first weeks of March, including the increase before and the decrease after the peak.The predicted values are slightly compressed compared with the true values.On the right panel, the feature importances are shown.The three most important parameters are the cohort study prevalences zero to two measurements (thus 0 to 10 days) before the current measurement.This corresponds with the findings of Fig. 3 and Table 1 that there is a certain time shift between the prevalence and the number of hospitalizations.The genecopies in wastewater do not influence the outcome considerably.

Finding 4: Wastewater data alone does not allow quantitative prediction of the cohort study prevalence
As observed in "Finding 3: Cohort study prevalence and wastewater data allow prediction of hospitalizations", the cohort study prevalence is a good predictor of the number of hospitalizations and, therefore, it seems to be a good indicator to investigate the development of the pandemic.Since it is less effort to test the wastewater than Table 1.Pairwise time lag correlation between the data sources.Each column shows a pairwise comparison of two of the three sources.In the rows, the correlations for the respective time lag are given.Time lag is given in measurements.For this prediction, additional parameters such as air or water temperature could also be used.This is because the prediction is done at the city level, where these parameters cannot be aggregated at federal state level (cf."Wastewater").We did the prediction for the cities of Koblenz, Kaiserslautern, and Mainz since for those, the investigated wastewater parameters are available.The prevalence was predicted for each city individually and the prevalence estimator for Rhineland-Palatinate was computed by combining the individual prevalences.As described in "Statistics", random forests with leave-one-out cross-validation were fitted.
Figure 5 shows the results.The left panel compares the true prevalence with the random-forest-predicted prevalence.The cohort study prevalence cannot be estimated well from the wastewater data.In particular, the predicted prevalence appears to be quite constant and the peak in late February is not detected.In the right panel, the feature importances are depicted.The most relevant parameters are the genecopies per ml at the same measurement, the genecopies per ml two measurements before, and the absolute number of genecopies at the last measurement.Apart from the Covid-19 genecopies in wastewater, the most relevant parameter seems to be the air temperature and the flow.

Discussion
In this paper, we described the different Covid-19 surveillance techniques that were applied in Rhineland-Palatinate, Germany.We investigated how the sources wastewater data, number of hospitalizations, and prevalence of a cohort study correlate and how they can be used to track the progress of the pandemic.We demonstrated that trends in wastewater are present for all three common normalization techniques (genecopies absolute, genecopies per ml, genecopies per references virus) and that in a qualitative manner, pandemic waves can be detected in all three data sources.Furthermore, it was shown that in particular the cohort study prevalence may be well suited to build a data-driven prediction model for the future number of Covid-19-related hospitalizations.While, to the best of our knowledge, this is the first publication comparing all these three data sources, it confirms the finding that qualitative pandemic trends can be detected in wastewater 3,4,6 .
There are some limitations that have to be noted to this observation.First, there is typically a time delay in collecting wastewater samples, analyzing the genecopies, and collecting the results of the cohort study patients.As a result, the data is not usually available immediately.To develop a valid prediction tool, the development of a fast data collection process would be essential.Second, the Omicron variant was the dominating variant during our entire observation period.During a pandemic, there are different variants that are dominating for a certain period of time.These variants may produce more or less genecopies in wastewater and lead to more or less hospitalizations.Consequently, a regular variant sequence analysis would be a necessary complement of an established pandemic observation tool built on wastewater.Of note, from February on, the dominated subvariant changed from BQ.1 to XBB.1.This suggests that the presented methodology is at least stable with regard to this subvariants.Third, the observed trends are mainly detectable due to the applied smoothing technique.These techniques are not valid at the boundaries of the observed time period and thus not well-suited to do extrapolation and predictions.A mathematical model that uses biological information might be a valid alternative to predict the future progress of the pandemic.
When setting up such a framework, different challenges have to be taken into account.For the wastewater data, wastewater has to be collected in sewage plants and transported to and analyzed by a laboratory.For the cohort study, one has to build a medical team, select and recruit the participants, send the tests, and establish a feedback loop.The resulting data has to be analyzed by statisticians.All these steps incur costs of time, money, and resources.Regarding the financial costs, the cohort study is more than twice as expensive as the wastewater data collection.
There are three possible extraction techniques for taking wastewater samples.One can extract time-proportionally, flow-proportionally, or volume-proportionally 32 .To obtain the most representative sample, it is recommended to apply volume-or flow-proportional extraction 7 .However, in our sample, all three extraction techniques were applied.To establish a regular tool for wastewater surveillance, it may be sound to apply the same extraction technique among different sewage plants and to avoid time-proportional extraction.
We aggregated the available data to the level of the federal state Rhineland-Palatinate, since the hopsitalization data was only available on this level.However, the cohort study was performed only on the five largest cities, i.e.only in urban areas and only on adults.Although these cities represent 713,000 inhabitants, i.e. 17% of the total population, and are geographically spread around the state, there may be structural differences in the infection characteristics that lead to a bias in the measured prevalence.
The wastewater treatment plants cover a size of 15,300 to 250,000 connected inhabitants, i.e., they are chosen from both rural and urban areas.In total 1,347,318 inhabitants are connected to the sewage plants, corresponding to a proportion of 33% of the total population of Rhineland-Palatinate.However, it is also not proven that these plants are perfectly representative for the federal state.
While we treat the hospitalization data as the gold standard for the burden of disease in the population, the hospitalization data only measures hospitalized people with a positive test and not necessarily patients hospitalized because of Covid-19.It is likely, that other (respiratory) diseases such as influenza led to an overproportional rate of hospital admissions and since we measure the absolute number of positive patients, we would then see higher levels despite potentially stable disease rates.
When analyzing the information in the wastewater data by introducing other parameters than genecopies, we focused on prediction on sewage plant level and aggregated the predictions to one value for Rhineland-Palatinate afterwards (cf."Finding 4: Wastewater data alone does not allow quantitative prediction of the cohort study prevalence") This was done since parameters as the air temperature or the pH value in water can hardly be summarized to one joined value among different cities.We observed a quite volatile cohort study prevalence on sewage plant level.This complicates the estimation of the prevalence and makes valid assertions on the benefit of wastewater data more difficult.
While in our data, an exact prediction of the cohort study prevalence by wastewater data without incorporating biological models was not possible, this does not imply that collecting wastewater data is senseless.Wastewater data may not be used for a quantitative prediction of the pandemic development but it may serve as an qualitative predictor.This means that an increase of the viral load in wastewater indicates a worsening of the pandemic situation.Furthermore, wastewater data can be gene-sequenced in contrast to a cohort study that measures positive Covid-19 tests.This implies that wastewater systems can also be used to test new variants or other viruses.There already exist approaches to influenza virus 33,34 and reviews on different viruses, including among other Hepatitis A viruses or noroviruses 35,36 .Since wastewater is easy and inexpensive to collect, it may develop to a more and more important tool to track the spread of different viruses in the population.The investigation of wastewater surveillance for further diseases is an attractive topic for future research.

Figure 1 .
Figure 1.Location of Rhineland-Palatinate in Germany and distribution of sewage plants (blue dots), cohort study cities (orange stars), and population density (grey shading).We created the maps ourselves with the python package geopandas 25 .The maps are both based upon open data sources, which we retrieved from http:// opend atalab.de/ proje cts/ geojs on-utili ties/ which itself uses geodata from the German Federal Agency for Cartography and Geodesy https:// gdz.bkg.bund.de/ and population data from the Federal Statistical Office of Germany https:// www.desta tis.de/ DE/ Themen/ Laend er-Regio nen/ Regio nales/ Gemei ndeve rzeic hnis/_ inhalt.html.The coordinates from the referenced cities were manually collected from Google Maps.

Figure 2 .
Figure 2. Wastewater values (mean, N1, and N2) in dependence of normalization technique.Main trends are detectable in all types of genecopies and all normalization techniques.Normalizing by flow or PMMoV regularizes the curve.

Figure 3 .
Figure 3.Comparison of prevalence, genecopies, and hospitalizations.The main wave is observable in all three parameters.

Figure 4 .Figure 5 .
Figure 4. Hospitalization prediction and corresponding feature importance derived by a random forest.The number of hospitalizations can be predicted well.The most important predictors are the cohort study prevalences.The term 'lag' defines the measurement 'lag' before the current measurement.